Learning objectives being assessed

Instructions

Marks

Exercises

1. (4pts) This question examines the boundaries for different models.

The lecture notes provide code to plot the QDA boundary between regions 2 and 3 of the olive oils data, using linoleic and arachidic acid. Use the same split of training and testing sets as used in lecture notes.

The dot plot of the overall data without standardization

Here we can see the plot of the olive_test data after the standardization

olive test

olive Train

a.(1pt) Write down the equation for discriminant function for region 2 as specified by QDA, using values computed from the data. (Use standardised variables.) Also, explain what assumption is not satisfied for QDA to be correctly applied for this data.

Under the QDA rule, the following equation are for discriminant function

δk(x)=xTΣk1x+xTΣk1μk12μkTΣk1μk12log|Σk|+log(πk)

FOr the formula of y

δk(y)=yTy+yTηk12ηkTηk+log(πk)

y=Akx

According to the data output showed below, we can get the discriminant function for region 2 is

δ2(x)=xTΣ21x+xTΣ21μ212μ2TΣ21μ212log|Σ2|+log(π2)

where in the following output we can get that

πk=0.3952096,μ2=[0.7483237,1.0746306]T
Then
Σk and log|Σk|

will be

$$
log|Σ2|=3.497169Σ21=[5.597961050.042958030.042958035.89923100]log(π2)=0.3952096Σ2=[0.1786464530.0013008980.0013008980.169523093]π2=0.3952096

$$

Under the QDA rule, the data should be a equal variance-covariance. In this question , It’s an unequal variance and covariance, so here we can conclude that the rule is not suitable in this case.

Moreover, the distribution of the observation in each class shld be a normal distribution with a class mean and class ocvariance. As for this question the 2 variables are not normal distribution instead of a multimodal distribtion. Thus, olive data cannot satisfied the assumption

# Call:
# qda(region ~ ., data = olive_train)
# 
# Prior probabilities of groups:
#         2         3 
# 0.3952096 0.6047904 
# 
# Group means:
#    arachidic   linoleic
# 2  0.7483237  1.0746306
# 3 -0.4893105 -0.6861894
#              arachidic     linoleic
# arachidic  0.178646453 -0.001300898
# linoleic  -0.001300898  0.169523093

b.(1pt) Make similar plots that shows the boundary between the two regions, that would be produced by .

(i) linear discriminant analysis,

# parsnip model object
# 
# Fit time:  21ms 
# Call:
# lda(region ~ ., data = data)
# 
# Prior probabilities of groups:
#         2         3 
# 0.3952096 0.6047904 
# 
# Group means:
#    arachidic   linoleic
# 2  0.7483237  1.0746306
# 3 -0.4893105 -0.6861894
# 
# Coefficients of linear discriminants:
#                  LD1
# arachidic -0.9182956
# linoleic  -2.0375057
The Linear Discriminant Matrix of the Train data
.pred_class 2 3
2 66 2
3 0 99
The Linear Discriminant Matrix of the Olive test data
.pred_class 2 3
2 32 0
3 0 50
The following are the test error.
.metric .estimator .estimate
accuracy binary 1
kap binary 1

This plot below have already showed the Boundary under the LDA prediction method:

In this question, we plot the boundary under training data for the reason that the training data have more observations compared to the test data, and the boundary under training data is also more precise than the boundary in test data.

(ii)* classification tree (using minsplit of 10) using the rpart engine.

Here we use ‘rpart’ engine to construct the tree, from the following table,we can conclude that there rare 167 roots, there are totally 2 classes are separated. In the class 2 , 66 roots are put in and selection criteria is linoleic>=0.5366322.In class 3, there are totally 101 roots are selected in and criteria is linoleic< 0.5366322.

# n= 167 
# 
# node), split, n, loss, yval, (yprob)
#       * denotes terminal node
# 
# 1) root 167 66 3 (0.3952096 0.6047904)  
#   2) linoleic>=0.5366322 66  0 2 (1.0000000 0.0000000) *
#   3) linoleic< 0.5366322 101  0 3 (0.0000000 1.0000000) *

In this graph below, we can tell there is a green line made under Classification Tree Method when we use the rpart engine and well separated 2 datasets.

c. (1pt) Compute the test balanced accuracy for the three models (i) LDA,(ii) QDA, (iii) classification tree. Are they equally as accurate?

Below are the degree of freedom of 3 methods

#    truth predict_lda predict_tree predict_qda
# 33     3           3            3           3
# 34     3           3            3           3
# 35     3           3            3           3
# 36     3           3            3           3
# 37     3           3            3           3
# 38     3           3            3           3
# 39     3           3            3           3
# 40     3           3            3           3
# 41     3           3            3           3
# 42     3           3            3           3
# 43     3           3            3           3
# 44     3           3            3           3
# 45     3           3            3           3
# 46     3           3            3           3
# 47     3           3            3           3
# 48     3           3            3           3
# 49     3           3            3           3
# 50     3           3            3           3
# 51     3           3            3           3
# 52     3           3            3           3
# 53     3           3            3           3
# 54     3           3            3           3
# 55     3           3            3           3
# 56     3           3            3           3
# 57     3           3            3           3
# 58     3           3            3           3
# 59     3           3            3           3
# 60     3           3            3           3
# 61     3           3            3           3
# 62     3           3            3           3
# 63     3           3            3           3
# 64     3           3            3           3
# 65     3           3            3           3
# 66     3           3            3           3
# 67     3           3            3           3
# 68     3           3            3           3
# 69     3           3            3           3
# 70     3           3            3           3
# 71     3           3            3           3
# 72     3           3            3           3
# 73     3           3            3           3
# 74     3           3            3           3
# 75     3           3            3           3
# 76     3           3            3           3
# 77     3           3            3           3
# 78     3           3            3           3
# 79     3           3            3           3
# 80     3           3            3           3
# 81     3           3            3           3
# 82     3           3            3           3

Here we can show that there are totally rows of 32:

# [1] 32

Then we calculate the balanced accuracy as follows:

# [1] 1
# [1] 1
# [1] 1

All of them are equally as accurate. The

Baaclda=Baacqda=Baactree

this means all of these 3 models have performed well in seperating different classes, as the balanced accuracy for each model are equal to 1 (100%) .

d. (1pt) Re-fit the three models with the additional variable oleic (so you now have three predictors). Write a paragraph discussing how the models change, and why this might be. ?

Here, first of all we should take a look of the confusion matrix of all the 3 models.

Here we reform the LDA model with oleic as follows:

# Call:
# lda(region ~ arachidic + linoleic, data = olive_train1)
# 
# Prior probabilities of groups:
#         2         3 
# 0.3952096 0.6047904 
# 
# Group means:
#    arachidic   linoleic
# 2  0.7483237  1.0746306
# 3 -0.4893105 -0.6861894
# 
# Coefficients of linear discriminants:
#                  LD1
# arachidic -0.9182956
# linoleic  -2.0375057

We also reform the QDA model with variable oleic

# Call:
# qda(region ~ arachidic + linoleic, data = olive_train1)
# 
# Prior probabilities of groups:
#         2         3 
# 0.3952096 0.6047904 
# 
# Group means:
#    arachidic   linoleic
# 2  0.7483237  1.0746306
# 3 -0.4893105 -0.6861894

We reform tree model after we added the oleic

# n= 167 
# 
# node), split, n, loss, yval, (yprob)
#       * denotes terminal node
# 
# 1) root 167 66 3 (0.3952096 0.6047904)  
#   2) linoleic>=0.5366322 66  0 2 (1.0000000 0.0000000) *
#   3) linoleic< 0.5366322 101  0 3 (0.0000000 1.0000000) *

The matrix for LDA

#           Truth
# Prediction  2  3
#          2 32  0
#          3  0 50
.metric .estimator .estimate
bal_accuracy binary 1

Matrix for QDA

#           Truth
# Prediction  2  3
#          2 31  0
#          3  1 50
.metric .estimator .estimate
bal_accuracy binary 0.984375

Here, in comparison with the 3 models, based on balance accuracy, we can get that after added the other variable ‘oleic’. The precision of LDA and tree model are same as without variable ‘oleic’. This means that after adding an extra variable, the LDA and tree model are still accurate to estimate this new model. However, the accuracy of QDA is changed from 1 to 0.984375. Although this accuracy is still high, however, in comparison with the previous one, adding an extra variable made the balance accuracy of QDA model decreased by 1-0.984375=0.015626

2. (6pts) This question examines a classification model equation

The palmerpenguins is a new R data package, with interesting measurements on penguins of three different species. Subset the data to contain just the Adelie and Gentoo species, and only the variables species and the four physical size measurement variables.

a. (1pt) Make a scatterplot matrix of the data, with species mapped to colour. Which variables would you expect to be the most important for distinguishing between the species? Is it fair to assume homogeneous variance-covariances? ?????

According to the scatter plot, we should select the plot when 2 area are tend to be more close, also within each group, the data should show a great correlation with each other.

# # A tibble: 6 x 5
#   species     bl    bd     fl     bm
#   <fct>    <dbl> <dbl>  <dbl>  <dbl>
# 1 Adelie  -0.693 0.926 -1.41  -0.680
# 2 Adelie  -0.616 0.280 -1.08  -0.620
# 3 Adelie  -0.462 0.578 -0.477 -1.28 
# 4 Adelie  -1.16  1.22  -0.610 -1.04 
# 5 Adelie  -0.655 1.87  -0.809 -0.799
# 6 Adelie  -0.732 0.479 -1.41  -0.829
# [1] 274   5

# Levene's Test for Homogeneity of Variance (center = median)
#        Df F value Pr(>F)
# group   1  1.6441 0.2009
#       272
# Levene's Test for Homogeneity of Variance (center = median)
#        Df F value  Pr(>F)  
# group   1  3.5213 0.06166 .
#       272                  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Levene's Test for Homogeneity of Variance (center = median)
#        Df F value Pr(>F)
# group   1   0.157 0.6922
#       272
# Levene's Test for Homogeneity of Variance (center = median)
#        Df F value Pr(>F)
# group   1   1.993 0.1592
#       272

Here, by observing the quantile performance, we can get the bl and fl separate 2 species well.

Then we can get more detailed look of by applying “featurePlot” analysis to have a look at the quantitative variable across the 2 species.

We clearly observe that median and interquartile ranges of the 2 penguin species are quite separated , though by different features in each species. By separation, the means and interquartile range of one class is distinct and don’t overlap with the other classes.

Here the variable bl and fl are estimated to separate the data well.

bl

From scatter plot, the data points are well separated with any combination that contains bl,which means the bill length may significantly different between 2 species.

Moreover, from barplot, the overlap between 2 species under bl are also small, which means 2 classes are more separated under bl criteria

fl

For fl inside the boxplot we can tell it has the lowest overlap among all the 4 variables. Meaning that for different species, there is a large difference between them. For scatterplot for fl, even though they are not as good as what bl displayed, since only very few data are mixed, we can still tell the difference among the 2 classees.

It’s fair to assume the variance and covariance for each group as homogeneous for the reason that according to the plot. They basically have same shape among all options here.

For the reason, Homogeneous variance-covariance means that each group has the same shape. The variance can be different in different projections, but it is the same for each group.It’s fair to assume the variance and covariance for each group as homogeneous for the reason that according to the plot , for this implies that variables from different datasets may have common or similar variance (for data contain Adelie only and data contain Gentoo only)

b. (1pt) Break the data into training and test sets. Fit an LDA model to the data, assuming equal prior probabilities for the two groups. Report confusion table and the misclassification error for the test set.

# Call:
# lda(species ~ ., data = penguin_train, prior = c(0.5, 0.5))
# 
# Prior probabilities of groups:
# Adelie Gentoo 
#    0.5    0.5 
# 
# Group means:
#                bl         bd         fl         bm
# Adelie -0.7748322  0.7029637 -0.8448656 -0.7544736
# Gentoo  0.9019430 -0.9331506  0.9727739  0.8715777
# 
# Coefficients of linear discriminants:
#           LD1
# bl  0.6885892
# bd -1.9509187
# fl  1.4069513
# bm  0.9037988

confusion matrix

# Confusion Matrix and Statistics
# 
#           Reference
# Prediction Adelie Gentoo
#     Adelie     50      0
#     Gentoo      0     41
#                                      
#                Accuracy : 1          
#                  95% CI : (0.9603, 1)
#     No Information Rate : 0.5495     
#     P-Value [Acc > NIR] : < 2.2e-16  
#                                      
#                   Kappa : 1          
#                                      
#  Mcnemar's Test P-Value : NA         
#                                      
#             Sensitivity : 1.0000     
#             Specificity : 1.0000     
#          Pos Pred Value : 1.0000     
#          Neg Pred Value : 1.0000     
#              Prevalence : 0.5495     
#          Detection Rate : 0.5495     
#    Detection Prevalence : 0.5495     
#       Balanced Accuracy : 1.0000     
#                                      
#        'Positive' Class : Adelie     
# 

Here the model appears to be accurate. THe overall accuracy is close to 1 and nearly 1 kappa on the test data.

According to the calculation method of the model error.

We can get that in the current dataset, everything matched well:

# [1] 0

c. (1pt) Write down the LDA rule, for classifying the two species, explicitly being clear about which species is class 1 and which is class 2. ?????

According to the LDA rule we can get that the following equation.New observation of X0 us belongs to class 1 if

x0TΣ1(μ1μ2)Dimension Reduction>12(μ1+μ2)TΣ1(μ1μ2)Dimension Reduction

In LDA’s definition, Class 1 and 2 need to be mapped in the your data. The class to the right on the reduced dimension will be class 1 in the equqation. So here after comparing the value with Adelie Penguin and Gentoo Penguin

Adelie should be Class 1 in this question if the above rule satisfies
here are the means and variables are displayed here

#              LD1
# Adelie -3.775543
# Gentoo  4.597946
# # A tibble: 4 x 2
#   variables `Different mean between group -`
#   <chr>                                <dbl>
# 1 BL                                   -1.68
# 2 BD                                    1.64
# 3 FL                                   -1.82
# 4 BM                                   -1.63
# # A tibble: 4 x 2
#   variables `Different mean between group +`
#   <chr>                                <dbl>
# 1 BL                                   0.127
# 2 BD                                  -0.230
# 3 FL                                   0.128
# 4 BM                                   0.117

d. (1pt) Report the group means and the pooled variance-covariance matrix, and show the computations to obtain the linear discriminant space is computed from these. (Hint: You can use the matlib package to compute matrix inverses, see help here.)

We select the group means:

#                bl         bd         fl         bm
# Adelie -0.7748322  0.7029637 -0.8448656 -0.7544736
# Gentoo  0.9019430 -0.9331506  0.9727739  0.8715777

Here we calculate the variance for each group

The variance of group Adelie

#            bl         bd         fl         bm
# bl 0.22524992 0.12158571 0.06008633 0.12070965
# bd 0.12158571 0.37552815 0.07290338 0.18188810
# fl 0.06008633 0.07290338 0.17334046 0.09076054
# bm 0.12070965 0.18188810 0.09076054 0.25946180

The variance of group Gentoo

#           bl        bd        fl        bm
# bl 0.3423178 0.1845119 0.1516292 0.2441749
# bd 0.1845119 0.2589256 0.1462688 0.2292669
# fl 0.1516292 0.1462688 0.1613951 0.1782303
# bm 0.2441749 0.2292669 0.1782303 0.3879481

Then we got the sum numbers of Adelie and Gentoo which are

According to the formula, we substitute the sum of numbers above into the pooled variance calculation we can get that

Here is the pooled variance-covariance matrix

#          [,1]
# bl  -5.765894
# bd  16.335996
# fl -11.781091
# bm  -7.567949

e. (1pt) Make a plot showing the data in the discriminant space, to examine how well the species are separated.

now we make a plot about the discriminant space of setA.

#    species         bl                 bd                 fl          
#  Adelie:50   Min.   :-2.04076   Min.   :-1.65696   Min.   :-1.73970  
#  Gentoo:41   1st Qu.:-0.78007   1st Qu.:-0.91195   1st Qu.:-0.80934  
#              Median :-0.01981   Median : 0.13106   Median :-0.07834  
#              Mean   : 0.04724   Mean   : 0.06065   Mean   : 0.06114  
#              3rd Qu.: 0.76933   3rd Qu.: 0.87606   3rd Qu.: 0.95170  
#              Max.   : 2.54007   Max.   : 2.16740   Max.   : 1.84884  
#        bm                LD1        
#  Min.   :-1.75620   Min.   :-6.720  
#  1st Qu.:-0.85900   1st Qu.:-4.298  
#  Median : 0.06811   Median :-2.680  
#  Mean   : 0.05201   Mean   :-0.364  
#  3rd Qu.: 0.81578   3rd Qu.: 4.261  
#  Max.   : 1.95223   Max.   : 6.300

Here, by using an density function we can also get that

According to the density plot aren’t intersect with each other, that means there is a good separation of the 2 classes of the Linear Discriminant method.

f. (1pt) Return to the data with the original three species of penguins. Conduct LDA for the three groups, and obtain the 2D discriminant space, using just the same four physical measurements. Using the tour, determine which of the variables are most important for distinguishing Chinstrap penguins from Adelie. Provide a plot or two to support your decision.

# # A tibble: 6 x 5
#   species     bl    bd     fl     bm
#   <fct>    <dbl> <dbl>  <dbl>  <dbl>
# 1 Adelie  -0.883 0.784 -1.42  -0.563
# 2 Adelie  -0.810 0.126 -1.06  -0.501
# 3 Adelie  -0.663 0.430 -0.421 -1.19 
# 4 Adelie  -1.32  1.09  -0.563 -0.937
# 5 Adelie  -0.847 1.75  -0.776 -0.688
# 6 Adelie  -0.920 0.329 -1.42  -0.719
# [1] 342   5

** Box plot and density plot can clearly show the relationship between four variables and easily tell which variable are important for distinguishing**

TOur graph with 4 different dimensions.

# parsnip model object
# 
# Fit time:  20ms 
# Call:
# lda(species ~ ., data = data, prior = ~c(1/3, 1/3, 1/3))
# 
# Prior probabilities of groups:
#    Adelie Chinstrap    Gentoo 
# 0.3333333 0.3333333 0.3333333 
# 
# Group means:
#           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
# Adelie          38.85545      18.40099          189.6238    3729.455
# Chinstrap       48.46957      18.34783          195.6522    3688.043
# Gentoo          47.67073      14.96707          217.1220    5049.085
# 
# Coefficients of linear discriminants:
#                            LD1          LD2
# bill_length_mm     0.037062318 -0.413284602
# bill_depth_mm     -1.020639811  0.092714992
# flipper_length_mm  0.092626372 -0.002792050
# body_mass_g        0.001515876  0.001782931
# 
# Proportion of trace:
#   LD1   LD2 
# 0.835 0.165
blbdflbm
~frame: 336912151821242730333740434649525558616467colourPlay
blbdflbm
~frame: 226912151821242730333740434649525558616467colourPlay
blbdflbm
~frame: 225811141720232629333639424548515457606367colourPlay
blbdflbm
~frame: 225811141720232629323639424548515457606367colourPlay

Here we can conclude after taking ‘chinstrap’ into consideration, we can get that bd are easy to tell the difference in order to separate Gentoo from Adelie and Chinstrap. Moreover, the value of bm can also separating Gentoo from Adelie and Chinstrap.

Furthermore, in such situation, we should always foruc on the tour to choose which variable are useful to separate species between Chinstrap and Adelie. According to the tour above, the green point is Adelie , whileas the orange point is chinstrap.According to the tour we can easily tell the bl is the most important criteria for the reason that after the projection, majority of green and orange cannot project on the bl. Based on this witness, we can say bl will retain important data for Chinstrap

3. (5pts) This question examines impurity metrics for a classification tree

Impurity metrics can be other than Gini or entropy. This is a metric proposed by Buja and Lee (2001) is called a one-sided extreme:

OSE=1max(p^L1,p^R1)
and is designed to do as well as possible in splitting off pure subsets of class 1. That is, it might help in a problem where one class is very important to predict well.

We are going to use this metric to build a spam filter for the spam data used in tutorial 4.

a. (0.5pt) Read in the spam data, set the levels for day of the week, and filter to the five most common domains, “com”, “edu”, “net”, “org”, “gov”. Drop variable spampct, because there are many missing values. (This variable was one that some of the mail software provided as a probability that the email was spam. Most mail software didn’t provide this.) Class 1 for this data would be spam=no. Explain why.

According to the rule of thumb, put the most important classes among the 2 classes into class one. For the reason the program asked us about separating spam, so ‘spam=no’ is more important, thus we put it into class 1.

Here Below are the Spam Data we will later used

After remove the spampct column

# # A tibble: 1,998 x 20
#    isuid    id `day of week` `time of day` size.kb box   domain local digits
#    <dbl> <dbl> <fct>                 <dbl>   <dbl> <chr> <chr>  <chr>  <dbl>
#  1     1     1 Thu                       0       7 no    com    no         0
#  2     1     2 Thu                       0       2 no    com    no         0
#  3     1     3 Thu                      14       3 no    edu    yes        0
#  4     1     9 Thu                       6       3 yes   edu    yes        0
#  5     1    11 Thu                       7       3 no    com    no         0
#  6     1    13 Thu                       8      12 yes   com    no         0
#  7     1    14 Thu                       8      12 yes   com    no         0
#  8     1    16 Thu                       9       2 yes   edu    yes        0
#  9     1    19 Thu                      10       2 no    edu    yes        0
# 10     1    23 Thu                      12       3 no    edu    yes        0
# # ... with 1,988 more rows, and 11 more variables: name <chr>, cappct <dbl>,
# #   special <dbl>, credit <chr>, sucker <chr>, porn <chr>, chain <chr>,
# #   username <chr>, large text <chr>, category <chr>, spam <chr>

b. (1pt) Considering only the domain variable, what would be all of the possible splits? Compute OSE for each possible split, and report the best split, by hand.

For all the possible split, the graph will be as follows. For each left and right bucket, will have 15 splits(rows).Each row of left are combined together with right are original 5 domain variables.

#       [,1] [,2]
#  [1,]    1    2
#  [2,]    1    3
#  [3,]    1    4
#  [4,]    1    5
#  [5,]    1    2
#  [6,]    2    3
#  [7,]    2    4
#  [8,]    2    5
#  [9,]    1    3
# [10,]    2    3
# [11,]    3    4
# [12,]    3    5
# [13,]    1    4
# [14,]    2    4
# [15,]    3    4
# [16,]    4    5
# [17,]    1    5
# [18,]    2    5
# [19,]    3    5
# [20,]    4    5
# [1] 807
# [1] 1998

Here we can get the right bucket will be:

# [[1]]
# [1] "edu" "net" "org" "gov"
# 
# [[2]]
# [1] "com" "net" "org" "gov"
# 
# [[3]]
# [1] "com" "edu" "org" "gov"
# 
# [[4]]
# [1] "com" "edu" "net" "gov"
# 
# [[5]]
# [1] "com" "edu" "net" "org"
# 
# [[6]]
# [1] "net" "org" "gov"
# 
# [[7]]
# [1] "edu" "org" "gov"
# 
# [[8]]
# [1] "edu" "net" "gov"
# 
# [[9]]
# [1] "edu" "net" "org"
# 
# [[10]]
# [1] "com" "org" "gov"
# 
# [[11]]
# [1] "com" "net" "gov"
# 
# [[12]]
# [1] "com" "net" "org"
# 
# [[13]]
# [1] "com" "edu" "gov"
# 
# [[14]]
# [1] "com" "edu" "org"
# 
# [[15]]
# [1] "com" "edu" "net"

The Left bucket will be

# [[1]]
# [1] "com"
# 
# [[2]]
# [1] "edu"
# 
# [[3]]
# [1] "net"
# 
# [[4]]
# [1] "org"
# 
# [[5]]
# [1] "gov"
# 
# [[6]]
# [1] "com" "edu"
# 
# [[7]]
# [1] "com" "net"
# 
# [[8]]
# [1] "com" "org"
# 
# [[9]]
# [1] "com" "gov"
# 
# [[10]]
# [1] "edu" "net"
# 
# [[11]]
# [1] "edu" "org"
# 
# [[12]]
# [1] "edu" "gov"
# 
# [[13]]
# [1] "net" "org"
# 
# [[14]]
# [1] "net" "gov"
# 
# [[15]]
# [1] "org" "gov"

Then the tested balance will be

# # A tibble: 15 x 2
#    index     ose
#    <dbl>   <dbl>
#  1     1 0.694  
#  2     2 0.00868
#  3     3 0.729  
#  4     4 0.154  
#  5     5 0.1    
#  6     6 0.309  
#  7     7 0.698  
#  8     8 0.677  
#  9     9 0.687  
# 10    10 0.0823 
# 11    11 0.0122 
# 12    12 0.00955
# 13    13 0.625  
# 14    14 0.680  
# 15    15 0.139

Inside this list, among all 15 observations, the 2nd observation has the least ose which is 0.008679.

We should choose 0.008679 as the best split where the index of this split is 2. In this case we can get that inside the left bucket will have domain as com,gov,net,org , in the right bucket will have the domain with only edu.

c. (1.5pt) Write a function to compute OSE, given a numeric variable. It needs to have an option for minimum split, so that you can restrict the minimum size for each subset. At what value would of size.kb would the split be made? (Using a minimum split value of 10.)

#      optimal_ose optimal_ose_v
# [1,]           0           113

Here after we called the function we set, we can get that when the size.kb is equal to 113, the OSE under minsplit=10 is least among all the other groups.

For the optimal_ose is equals to 0, there maybe have decimals haven’t been displayed here, so we can conclude under size = 113, the optimal_ose is least among all the other group.

d. (1pt) Use your function to compute OSE for each possible split for the other numeric variables (time of day, digits, cappct, special). Which of these would produce the best split? And what is the split?

OSE=1max(p^L1,p^R1)

Below are the columns names of the spam data.

#  [1] "isuid"       "id"          "day of week" "time of day" "size.kb"    
#  [6] "box"         "domain"      "local"       "digits"      "name"       
# [11] "cappct"      "special"     "credit"      "sucker"      "porn"       
# [16] "chain"       "username"    "large text"  "category"    "spam"

We use the ose function to calculate OSE for variable “time of the day”

#      optimal_ose optimal_ose_v
# [1,]   0.2776457             7

We use the function to calculate the variable “special”

#      optimal_ose optimal_ose_v
# [1,]   0.2638581             0

Here, we use the function to calculate the OSE of variable “digits”

#      optimal_ose optimal_ose_v
# [1,]   0.2657224             0

Here, we use the function to calculate the OSE of variable “cappct”

#      optimal_ose optimal_ose_v
# [1,]    0.273913             0

As for the calculation above, we can apply the function we set above, we can conclude that the best split will be generated by variable special, with the value of optimal OSE is 0.2638581. Here, for the reason that there are no decimal been kept and special can only be integer . Thus, as for varialbe ‘special’ only, when special is zero we can get the optimal_ose and the ose value when variable ‘special’ is zero is 0.2638581

e. (1pt) Make a plot of the Gini measure for impurity, against all values of p (0-1), Overlay the OSE index. Compare the two measures, based on the worst and best p for each.

The Gini measure formula will be

G=Σk=1kp^mk(1p^mk)
In this question, $p_{mk}= p $ thus (1p^mk)=(1p). If p is left bucket then 1p is right bucket. Vice Versa.

For OSE the function is

OSE=1max(p^L1,p^R1)

For Gini index , the lower value the better measurement from the following graph we can conclude that when P=0.5 is the worst, while P=0 or P=1, Gini index will perform better.

For OSE , the lower value of OSE means the better model is. When the P=0 OSE will be very small, which is the optimal split, when p=1 , the value of OSE is high, which is worse split

4. (5pts) Conducting and interpreting a PCA

Principal component analysis is often used to create indicator variables (see e.g. Constructing socio-economic status indices: how to use principal components analysis). In this question, you will look at the socioeconomic data provided on kaggle to create an indicator variable, using PCA.

a. (0.5pt) How many PCs are possible to compute on this data?

# spec_tbl_df[,10] [167 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#  $ country   : chr [1:167] "Afghanistan" "Albania" "Algeria" "Angola" ...
#  $ child_mort: num [1:167] 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
#  $ exports   : num [1:167] 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
#  $ health    : num [1:167] 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
#  $ imports   : num [1:167] 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
#  $ income    : num [1:167] 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
#  $ inflation : num [1:167] 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
#  $ life_expec: num [1:167] 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
#  $ total_fer : num [1:167] 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
#  $ gdpp      : num [1:167] 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
#  - attr(*, "spec")=
#   .. cols(
#   ..   country = col_character(),
#   ..   child_mort = col_double(),
#   ..   exports = col_double(),
#   ..   health = col_double(),
#   ..   imports = col_double(),
#   ..   income = col_double(),
#   ..   inflation = col_double(),
#   ..   life_expec = col_double(),
#   ..   total_fer = col_double(),
#   ..   gdpp = col_double()
#   .. )

We can conclude there are 9 PCs need to be calculated for there are totally 9 variables.

b. (1pt) Compute a PCA. What proportion of variance does the first PC explain?

# Standard deviations (1, .., p=9):
# [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
# [8] 0.2971790 0.2586020
# 
# Rotation (n x k) = (9 x 9):
#                   PC1          PC2         PC3          PC4         PC5
# child_mort -0.4195194 -0.192883937  0.02954353 -0.370653262  0.16896968
# exports     0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
# health      0.1508378  0.243086779  0.59663237 -0.461897497 -0.51800037
# imports     0.1614824 -0.671820644  0.29992674  0.071907461 -0.25537642
# income      0.3984411 -0.022535530 -0.30154750 -0.392159039  0.24714960
# inflation  -0.1931729  0.008404473 -0.64251951 -0.150441762 -0.71486910
# life_expec  0.4258394  0.222706743 -0.11391854  0.203797235 -0.10821980
# total_fer  -0.4037290 -0.155233106 -0.01954925 -0.378303645  0.13526221
# gdpp        0.3926448  0.046022396 -0.12297749 -0.531994575  0.18016662
#                     PC6         PC7         PC8         PC9
# child_mort -0.200628153  0.07948854  0.68274306  0.32754180
# exports     0.059332832  0.70730269  0.01419742 -0.12308207
# health     -0.007276456  0.24983051 -0.07249683  0.11308797
# imports     0.030031537 -0.59218953  0.02894642  0.09903717
# income     -0.160346990 -0.09556237 -0.35262369  0.61298247
# inflation  -0.066285372 -0.10463252  0.01153775 -0.02523614
# life_expec  0.601126516 -0.01848639  0.50466425  0.29403981
# total_fer   0.750688748 -0.02882643 -0.29335267 -0.02633585
# gdpp       -0.016778761 -0.24299776  0.24969636 -0.62564572

According to the table above, there are totally 9PCs are possible to calculate in this data.

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
variance 4.1357 1.5463 1.1704 0.9948 0.6606 0.2236 0.1134 0.0883 0.0669
peopoerion 0.4595 0.1718 0.1300 0.1105 0.0734 0.0248 0.0126 0.0098 0.0074
Cum.prop 0.4595 0.6313 0.7614 0.8719 0.9453 0.9702 0.9828 0.9926 1.0000

First principal component

The first principal component of a set of variables x1,x2,,xp is the linear combination

z1=ϕ11x1+ϕ21x2++ϕp1xp

PC1=0.4195194×childmort+0.2838970×exports+0.1508378×health+0.1614824×imports+0.3984411×income0.1931729×inflation+0.4258394×lifeexpec0.4037290 timestotalfer+0.3926448×gdpp

We can get the following summary as

# Importance of components:
#                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
# Standard deviation     2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
# Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
# Cumulative Proportion  0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
#                            PC8     PC9
# Standard deviation     0.29718 0.25860
# Proportion of Variance 0.00981 0.00743
# Cumulative Proportion  0.99257 1.00000

According to the graph, the proportion of variance explained by the first PC is 0.4595, that is 45.95% data are explained by PC1

c. (1pt) Examine the loading for the PC1. Make a plot of the loading (like done during tutorial).

The loading of PC1 is shown as follows. Based on the plot of loadings, there are lots of variability in the loadings coefficient and a bootstrap is requred to be used in order to generated a more precise and relevant variables.

d. (1pt) Use bootstrap to assess which variables could be considered unimportant for PC1 (ie loading not significantly different from 0).

According the graph we can conclude that on PC1, the following variables childmort,totalfer,income , $life_{expec}and GDPP. That means these variables are significantly diffferent from 0,so in such situation we should take it as their means outside the confidence interval. Meanwhile, from the other aspect, the exports, $ health$, imports and inflation is not slightly different from 0 and within confidence interval , however , non of them accross the “0” boundary. Thus, we can conclude these variables are less significance and can be rejected to fit a new model.

e. (0.5pt) Write down the formula for your new indicator variable. What would you consider it to be an indicator of? That is, interpret PC1 based on the loadings.

# # A tibble: 9 x 9
#      PC1      PC2     PC3      PC4     PC5      PC6     PC7     PC8     PC9
#    <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
# 1  0.426  0.223   -0.114   0.204   -0.108   0.601   -0.0185  0.505   0.294 
# 2  0.398 -0.0225  -0.302  -0.392    0.247  -0.160   -0.0956 -0.353   0.613 
# 3  0.393  0.0460  -0.123  -0.532    0.180  -0.0168  -0.243   0.250  -0.626 
# 4  0.284 -0.613   -0.145  -0.00309 -0.0576  0.0593   0.707   0.0142 -0.123 
# 5  0.161 -0.672    0.300   0.0719  -0.255   0.0300  -0.592   0.0289  0.0990
# 6  0.151  0.243    0.597  -0.462   -0.518  -0.00728  0.250  -0.0725  0.113 
# 7 -0.193  0.00840 -0.643  -0.150   -0.715  -0.0663  -0.105   0.0115 -0.0252
# 8 -0.404 -0.155   -0.0195 -0.378    0.135   0.751   -0.0288 -0.293  -0.0263
# 9 -0.420 -0.193    0.0295 -0.371    0.169  -0.201    0.0795  0.683   0.328

We here after select the useful variables, we construct a newpca that is consisted of the useful variables here.

# Standard deviations (1, .., p=5):
# [1] 1.9065314 0.9721904 0.4821562 0.3239670 0.2873232
# 
# Rotation (n x k) = (5 x 5):
#                   PC1        PC2           PC3        PC4        PC5
# child_mort -0.4654859 -0.4031338  1.944752e-01 -0.2900511  0.7062972
# income      0.4297577 -0.5358646  1.499573e-01  0.6779796  0.2145085
# life_expec  0.4790410  0.2223537 -6.274361e-01 -0.1626967  0.5485730
# total_fer  -0.4420473 -0.4036622 -7.389291e-01  0.1993868 -0.2363890
# gdpp        0.4168274 -0.5813329  1.490296e-05 -0.6244907 -0.3135575
# Importance of components:
#                          PC1    PC2     PC3     PC4     PC5
# Standard deviation     1.907 0.9722 0.48216 0.32397 0.28732
# Proportion of Variance 0.727 0.1890 0.04649 0.02099 0.01651
# Cumulative Proportion  0.727 0.9160 0.96250 0.98349 1.00000

The new formula of PC1 will be

PC1=0.4654859×childmort+0.4297577×income+0.4790410×lifeexpec0.4420473×totalfer+0.4168274×gdpp

** Here’s the reason why we should choose PC1 as our indicator variable is first of all, PC1 has a Cumulative proportion

After we selected the significant variables, We have made a pca4 only consists

childmort, income, lifeexpec, totalfer, gdpp

f. (1pt) Make a biplot of the first two PCs. Which few countries have the highest values, and which few countries have the lowest value on PC1? What does it mean for a country to have a high value on PC1?

According to the Biplot,luxembourg,Singapore,Qatar , Malta are the 5 highest value of PC1. Meanwhile, Nigeria,Haiti,Central African Republic, chad are 4 countries with smallest PC1. Based on the macroeconomic theory, we can conclude the developed countries are tend to have high PC1, those developing countries with low development status will tend to have low PC1s. Also we can have a look of the coloured plot, we can conclude that high value of imports, exports, income , health , life_expec,gdpp will have positive impact on PC1. On the contrary, variables like child_mort, inflation,total_fer will also have a negative impact on PC1.

All in all, based on the two plots, we can say that a country with a high PC1 will also high values on imports, exports, income , health , life_expec,gdpp and low values on child_mort, inflation,total_fer, the countries are tend to be more developed. A country with low PC1 will have low values on imports, exports, income , health , life_expec,gdpp and high value on child_mort, inflation,total_fer and tend to be less developed country

Citation

e1071 David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel and Friedrich Leisch (2021). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-6. https://CRAN.R-project.org/package=e1071

tidyverse Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

GGally Barret Schloerke, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg and Jason Crowley (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.1. https://CRAN.R-project.org/package=GGally

MASS Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

Discrim

Max Kuhn (2020). discrim: Model Wrappers for Discriminant Analysis. R package version 0.1.1. https://CRAN.R-project.org/package=discrim

tidymodels

Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org

KableExtra Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra

rpart

Terry Therneau and Beth Atkinson (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart

palmerpenguins

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/

plotly

C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.

tidyr

Hadley Wickham (2021). tidyr: Tidy Messy Data. R package version 1.1.3. https://CRAN.R-project.org/package=tidyr

tourr

Hadley Wickham, Dianne Cook, Heike Hofmann, Andreas Buja (2011). tourr: An R Package for Exploring Multivariate Data with Projections. Journal of Statistical Software, 40(2), 1-18. URL http://www.jstatsoft.org/v40/i02/.

rsample Julia Silge, Fanny Chow, Max Kuhn and Hadley Wickham (2021). rsample: General Resampling Infrastructure. R package version 0.0.9. https://CRAN.R-project.org/package=rsample

persnip

Max Kuhn and Davis Vaughan (2021). parsnip: A Common API to Modeling and Analysis Functions. R package version 0.1.5. https://CRAN.R-project.org/package=parsnip

yardstick

Max Kuhn and Davis Vaughan (2021). yardstick: Tidy Characterizations of Model Performance. R package version 0.0.8. https://CRAN.R-project.org/package=yardstick

spinifex

Nicholas Spyrison and Dianne Cook (2021). spinifex: Manual Tours, Manual Control of Dynamic Projections of Numeric Multivariate Data. R package version 0.2.8. https://CRAN.R-project.org/package=spinifex

dplyr

Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr

skimr Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu and Shannon Ellis (2021). skimr: Compact and Flexible Summaries of Data. R package version 2.1.3. https://CRAN.R-project.org/package=skimr

caret Max Kuhn (2020). caret: Classification and Regression Training. R package version 6.0-86. https://CRAN.R-project.org/package=caret

ggplot2 H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

klaR Weihs, C., Ligges, U., Luebke, K. and Raabe, N. (2005). klaR Analyzing German Business Cycles. In Baier, D., Decker, R. and Schmidt-Thieme, L. (eds.). Data Analysis and Decision Support, 335-343, Springer-Verlag, Berlin.

knitr

Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.33.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

boot Angelo Canty and Brian Ripley (2021). boot: Bootstrap R (S-Plus) Functions. R package version 1.3-27.

Davison, A. C. & Hinkley, D. V. (1997) Bootstrap Methods and Their Applications. Cambridge University Press, Cambridge. ISBN 0-521-57391-2

ggrepel Kamil Slowikowski (2021). ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.9.1. https://CRAN.R-project.org/package=ggrepel

# 
# To cite package 'ggrepel' in publications use:
# 
#   Kamil Slowikowski (2021). ggrepel: Automatically Position
#   Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.1.
#   https://CRAN.R-project.org/package=ggrepel
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {ggrepel: Automatically Position Non-Overlapping Text Labels with
# 'ggplot2'},
#     author = {Kamil Slowikowski},
#     year = {2021},
#     note = {R package version 0.9.1},
#     url = {https://CRAN.R-project.org/package=ggrepel},
#   }
# 
#   Wickham et al., (2019). Welcome to the tidyverse. Journal of Open
#   Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
# 
# A BibTeX entry for LaTeX users is
# 
#   @Article{,
#     title = {Welcome to the {tidyverse}},
#     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
#     year = {2019},
#     journal = {Journal of Open Source Software},
#     volume = {4},
#     number = {43},
#     pages = {1686},
#     doi = {10.21105/joss.01686},
#   }
# 
#   Kuhn et al., (2020). Tidymodels: a collection of packages for
#   modeling and machine learning using tidyverse principles.
#   https://www.tidymodels.org
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles.},
#     author = {Max Kuhn and Hadley Wickham},
#     url = {https://www.tidymodels.org},
#     year = {2020},
#   }
# 
# To cite package 'discrim' in publications use:
# 
#   Max Kuhn (2020). discrim: Model Wrappers for Discriminant Analysis. R
#   package version 0.1.1. https://CRAN.R-project.org/package=discrim
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {discrim: Model Wrappers for Discriminant Analysis},
#     author = {Max Kuhn},
#     year = {2020},
#     note = {R package version 0.1.1},
#     url = {https://CRAN.R-project.org/package=discrim},
#   }
# 
# To cite the MASS package in publications use:
# 
#   Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with
#   S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
# 
# A BibTeX entry for LaTeX users is
# 
#   @Book{,
#     title = {Modern Applied Statistics with S},
#     author = {W. N. Venables and B. D. Ripley},
#     publisher = {Springer},
#     edition = {Fourth},
#     address = {New York},
#     year = {2002},
#     note = {ISBN 0-387-95457-0},
#     url = {https://www.stats.ox.ac.uk/pub/MASS4/},
#   }
# 
# To cite the 'boot' package in publications use:
# 
#   Angelo Canty and Brian Ripley (2021). boot: Bootstrap R (S-Plus)
#   Functions. R package version 1.3-27.
# 
#   Davison, A. C. & Hinkley, D. V. (1997) Bootstrap Methods and Their
#   Applications. Cambridge University Press, Cambridge. ISBN
#   0-521-57391-2
# 
# To see these entries in BibTeX format, use 'print(<citation>,
# bibtex=TRUE)', 'toBibtex(.)', or set
# 'options(citation.bibtex.max=999)'.
# 
# To cite package 'ggrepel' in publications use:
# 
#   Kamil Slowikowski (2021). ggrepel: Automatically Position
#   Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.1.
#   https://CRAN.R-project.org/package=ggrepel
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {ggrepel: Automatically Position Non-Overlapping Text Labels with
# 'ggplot2'},
#     author = {Kamil Slowikowski},
#     year = {2021},
#     note = {R package version 0.9.1},
#     url = {https://CRAN.R-project.org/package=ggrepel},
#   }
# 
# To cite package 'kableExtra' in publications use:
# 
#   Hao Zhu (2021). kableExtra: Construct Complex Table with 'kable' and
#   Pipe Syntax. R package version 1.3.4.
#   https://CRAN.R-project.org/package=kableExtra
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
#     author = {Hao Zhu},
#     year = {2021},
#     note = {R package version 1.3.4},
#     url = {https://CRAN.R-project.org/package=kableExtra},
#   }
# 
# To cite package 'caret' in publications use:
# 
#   Max Kuhn (2020). caret: Classification and Regression Training. R
#   package version 6.0-86. https://CRAN.R-project.org/package=caret
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {caret: Classification and Regression Training},
#     author = {Max Kuhn},
#     year = {2020},
#     note = {R package version 6.0-86},
#     url = {https://CRAN.R-project.org/package=caret},
#   }
# 
# To cite package 'dplyr' in publications use:
# 
#   Hadley Wickham, Romain François, Lionel Henry and Kirill Müller
#   (2021). dplyr: A Grammar of Data Manipulation. R package version
#   1.0.5. https://CRAN.R-project.org/package=dplyr
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {dplyr: A Grammar of Data Manipulation},
#     author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller},
#     year = {2021},
#     note = {R package version 1.0.5},
#     url = {https://CRAN.R-project.org/package=dplyr},
#   }
# 
# To cite tourr in publications use:
# 
#   Hadley Wickham, Dianne Cook, Heike Hofmann, Andreas Buja (2011).
#   tourr: An R Package for Exploring Multivariate Data with Projections.
#   Journal of Statistical Software, 40(2), 1-18. URL
#   http://www.jstatsoft.org/v40/i02/.
# 
# When using the slice tour, please cite:
# 
#   Ursula Laa, Dianne Cook, German Valencia (2020). A slice tour for
#   finding hollowness in high-dimensional data. Journal of Computational
#   and Graphical Statistics, 29:3, 681-687. DOI:
#   10.1080/10618600.2020.1777140
# 
# To see these entries in BibTeX format, use 'print(<citation>,
# bibtex=TRUE)', 'toBibtex(.)', or set
# 'options(citation.bibtex.max=999)'.
# 
# To cite package 'magrittr' in publications use:
# 
#   Stefan Milton Bache and Hadley Wickham (2020). magrittr: A
#   Forward-Pipe Operator for R. R package version 2.0.1.
#   https://CRAN.R-project.org/package=magrittr
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {magrittr: A Forward-Pipe Operator for R},
#     author = {Stefan Milton Bache and Hadley Wickham},
#     year = {2020},
#     note = {R package version 2.0.1},
#     url = {https://CRAN.R-project.org/package=magrittr},
#   }
# 
# To cite ggplot2 in publications, please use:
# 
#   H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
#   Springer-Verlag New York, 2016.
# 
# A BibTeX entry for LaTeX users is
# 
#   @Book{,
#     author = {Hadley Wickham},
#     title = {ggplot2: Elegant Graphics for Data Analysis},
#     publisher = {Springer-Verlag New York},
#     year = {2016},
#     isbn = {978-3-319-24277-4},
#     url = {https://ggplot2.tidyverse.org},
#   }
# 
# To cite package 'rpart' in publications use:
# 
#   Terry Therneau and Beth Atkinson (2019). rpart: Recursive
#   Partitioning and Regression Trees. R package version 4.1-15.
#   https://CRAN.R-project.org/package=rpart
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {rpart: Recursive Partitioning and Regression Trees},
#     author = {Terry Therneau and Beth Atkinson},
#     year = {2019},
#     note = {R package version 4.1-15},
#     url = {https://CRAN.R-project.org/package=rpart},
#   }
# 
# To cite package 'rpart.plot' in publications use:
# 
#   Stephen Milborrow (2020). rpart.plot: Plot 'rpart' Models: An
#   Enhanced Version of 'plot.rpart'. R package version 3.0.9.
#   https://CRAN.R-project.org/package=rpart.plot
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {rpart.plot: Plot 'rpart' Models: An Enhanced Version of 'plot.rpart'},
#     author = {Stephen Milborrow},
#     year = {2020},
#     note = {R package version 3.0.9},
#     url = {https://CRAN.R-project.org/package=rpart.plot},
#   }
# 
# ATTENTION: This citation information has been auto-generated from the
# package DESCRIPTION file and may need manual editing, see
# 'help("citation")'.
# 
# To cite the 'knitr' package in publications use:
# 
#   Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report
#   Generation in R. R package version 1.33.
# 
#   Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition.
#   Chapman and Hall/CRC. ISBN 978-1498716963
# 
#   Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible
#   Research in R. In Victoria Stodden, Friedrich Leisch and Roger D.
#   Peng, editors, Implementing Reproducible Computational Research.
#   Chapman and Hall/CRC. ISBN 978-1466561595
# 
# To see these entries in BibTeX format, use 'print(<citation>,
# bibtex=TRUE)', 'toBibtex(.)', or set
# 'options(citation.bibtex.max=999)'.
# 
# To cite klaR in publications use:
# 
#   Weihs, C., Ligges, U., Luebke, K. and Raabe, N. (2005). klaR
#   Analyzing German Business Cycles. In Baier, D., Decker, R. and
#   Schmidt-Thieme, L. (eds.). Data Analysis and Decision Support,
#   335-343, Springer-Verlag, Berlin.
# 
# A BibTeX entry for LaTeX users is
# 
#   @InProceedings{,
#     title = {klaR Analyzing German Business Cycles},
#     author = {Claus Weihs and Uwe Ligges and Karsten Luebke and Nils Raabe},
#     booktitle = {Data Analysis and Decision Support},
#     editor = {D. Baier and R. Decker and L. Schmidt-Thieme},
#     publisher = {Springer-Verlag},
#     address = {Berlin},
#     year = {2005},
#     pages = {335-343},
#   }
# 
# To cite package 'skimr' in publications use:
# 
#   Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la
#   Rubia, Hao Zhu and Shannon Ellis (2021). skimr: Compact and Flexible
#   Summaries of Data. R package version 2.1.3.
#   https://CRAN.R-project.org/package=skimr
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {skimr: Compact and Flexible Summaries of Data},
#     author = {Elin Waring and Michael Quinn and Amelia McNamara and Eduardo {Arino de la Rubia} and Hao Zhu and Shannon Ellis},
#     year = {2021},
#     note = {R package version 2.1.3},
#     url = {https://CRAN.R-project.org/package=skimr},
#   }
# 
# To cite package 'spinifex' in publications use:
# 
#   Nicholas Spyrison and Dianne Cook (2021). spinifex: Manual Tours,
#   Manual Control of Dynamic Projections of Numeric Multivariate Data. R
#   package version 0.2.8. https://CRAN.R-project.org/package=spinifex
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {spinifex: Manual Tours, Manual Control of Dynamic Projections of Numeric
# Multivariate Data},
#     author = {Nicholas Spyrison and Dianne Cook},
#     year = {2021},
#     note = {R package version 0.2.8},
#     url = {https://CRAN.R-project.org/package=spinifex},
#   }
# 
# To cite package 'yardstick' in publications use:
# 
#   Max Kuhn and Davis Vaughan (2021). yardstick: Tidy Characterizations
#   of Model Performance. R package version 0.0.8.
#   https://CRAN.R-project.org/package=yardstick
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {yardstick: Tidy Characterizations of Model Performance},
#     author = {Max Kuhn and Davis Vaughan},
#     year = {2021},
#     note = {R package version 0.0.8},
#     url = {https://CRAN.R-project.org/package=yardstick},
#   }
# 
# To cite package 'parsnip' in publications use:
# 
#   Max Kuhn and Davis Vaughan (2021). parsnip: A Common API to Modeling
#   and Analysis Functions. R package version 0.1.5.
#   https://CRAN.R-project.org/package=parsnip
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {parsnip: A Common API to Modeling and Analysis Functions},
#     author = {Max Kuhn and Davis Vaughan},
#     year = {2021},
#     note = {R package version 0.1.5},
#     url = {https://CRAN.R-project.org/package=parsnip},
#   }
# 
# To cite package 'rsample' in publications use:
# 
#   Julia Silge, Fanny Chow, Max Kuhn and Hadley Wickham (2021). rsample:
#   General Resampling Infrastructure. R package version 0.0.9.
#   https://CRAN.R-project.org/package=rsample
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {rsample: General Resampling Infrastructure},
#     author = {Julia Silge and Fanny Chow and Max Kuhn and Hadley Wickham},
#     year = {2021},
#     note = {R package version 0.0.9},
#     url = {https://CRAN.R-project.org/package=rsample},
#   }
# 
# To cite plotly in publications use:
# 
#   C. Sievert. Interactive Web-Based Data Visualization with R, plotly,
#   and shiny. Chapman and Hall/CRC Florida, 2020.
# 
# A BibTeX entry for LaTeX users is
# 
#   @Book{,
#     author = {Carson Sievert},
#     title = {Interactive Web-Based Data Visualization with R, plotly, and shiny},
#     publisher = {Chapman and Hall/CRC},
#     year = {2020},
#     isbn = {9781138331457},
#     url = {https://plotly-r.com},
#   }
# 
# To cite package 'GGally' in publications use:
# 
#   Barret Schloerke, Di Cook, Joseph Larmarange, Francois Briatte,
#   Moritz Marbach, Edwin Thoen, Amos Elberg and Jason Crowley (2021).
#   GGally: Extension to 'ggplot2'. R package version 2.1.1.
#   https://CRAN.R-project.org/package=GGally
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {GGally: Extension to 'ggplot2'},
#     author = {Barret Schloerke and Di Cook and Joseph Larmarange and Francois Briatte and Moritz Marbach and Edwin Thoen and Amos Elberg and Jason Crowley},
#     year = {2021},
#     note = {R package version 2.1.1},
#     url = {https://CRAN.R-project.org/package=GGally},
#   }
# 
# To cite package 'pracma' in publications use:
# 
#   Hans W. Borchers (2021). pracma: Practical Numerical Math Functions.
#   R package version 2.3.3. https://CRAN.R-project.org/package=pracma
# 
# A BibTeX entry for LaTeX users is
# 
#   @Manual{,
#     title = {pracma: Practical Numerical Math Functions},
#     author = {Hans W. Borchers},
#     year = {2021},
#     note = {R package version 2.3.3},
#     url = {https://CRAN.R-project.org/package=pracma},
#   }
# 
# To cite the car package in publications use:
# 
#   John Fox and Sanford Weisberg (2019). An {R} Companion to Applied
#   Regression, Third Edition. Thousand Oaks CA: Sage. URL:
#   https://socialsciences.mcmaster.ca/jfox/Books/Companion/
# 
# A BibTeX entry for LaTeX users is
# 
#   @Book{,
#     title = {An {R} Companion to Applied Regression},
#     edition = {Third},
#     author = {John Fox and Sanford Weisberg},
#     year = {2019},
#     publisher = {Sage},
#     address = {Thousand Oaks {CA}},
#     url = {https://socialsciences.mcmaster.ca/jfox/Books/Companion/},
#   }
# 
# Please cite the 'rattle' package in publications using:
# 
#   Williams, G. J. (2011), Data Mining with Rattle and R: The Art of
#   Excavating Data for Knowledge Discovery, Use R!, Springer.
# 
# A BibTeX entry for LaTeX users is
# 
#   @Book{,
#     title = {Data Mining with {Rattle} and {R}: The art of excavating data for knowledge discovery},
#     author = {Graham J. Williams},
#     publisher = {Springer},
#     series = {Use R!},
#     year = {2011},
#     url = {http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896},
#   }